>> What is Byte-Pair Encoding (BPE)?

   Byte-Pair Encoding (BPE) was initially a method to compress text files. OpenAI later adapted it for tokenizing text when training models like GPT. BPE is used in many Transformer models, such as GPT, RoBERTa, BART, and DeBERTa. It works by merging the most frequent pairs of characters in a text to build a vocabulary of subword tokens.

>> How BPE Training Works

   1. Start with Basic Symbols:
      - Begin by breaking down the text into its smallest components, usually characters.
      - For example, if our text includes the words `"hug"`, `"pug"`, `"pun"`, `"bun"`, and `"hugs"`, our initial set of characters might be `["b", "g", "h", "n", "p", "s", "u"]`.

   2. Build the Vocabulary:
      - Create a vocabulary with all characters (and potentially some special tokens).
      - Add new tokens based on the most common character pairs.

   3. Merging Tokens:
      - Start with single characters and progressively merge the most frequent pairs.
      - For instance, if `"u"` and `"g"` frequently appear together, they get merged into `"ug"`.
      - Continue merging until the desired vocabulary size is achieved.

>> Example Process

   1. Initial Tokenization:
      - Words are split into characters.
      - Example: `"hug"` becomes `["h", "u", "g"]`.

   2. Find Frequent Pairs:
      - Identify which pairs of characters appear most often.
      - Example: If `"u"` and `"g"` are the most frequent pair, merge them into `"ug"`.

   3. Update Vocabulary and Corpus:
      - Add the new token to the vocabulary.
      - Replace the pair in the corpus with the new token.
      - Repeat until the vocabulary is complete.

>> Code Example

   Here's a simplified version of how you might implement BPE in code:

   1. Create a Corpus:
   
   python code

   '''
   corpus = [
       "This is the Hugging Face Course.",
       "This chapter is about tokenization.",
       "This section shows several tokenizer algorithms.",
       "Hopefully, you will be able to understand how they are trained and generate tokens.",
   ]
   '''

   2. **Pre-Tokenize the Corpus:**
   
   python code

   '''
   from transformers import AutoTokenizer

   tokenizer = AutoTokenizer.from_pretrained("gpt2")
   '''

   3. **Compute Word Frequencies:**

   python code

   '''
   from collections import defaultdict

   word_freqs = defaultdict(int)
   for text in corpus:
       words_with_offsets = tokenizer.backend_tokenizer.pre_tokenizer.pre_tokenize_str(text)
       new_words = [word for word, offset in words_with_offsets]
       for word in new_words:
           word_freqs[word] += 1
   print(word_freqs)
   '''

   4. Create Base Vocabulary:

   python code

   '''
   alphabet = []
   for word in word_freqs.keys():
       for letter in word:
           if letter not in alphabet:
               alphabet.append(letter)
   alphabet.sort()
   print(alphabet)
   '''

   5. Compute Pair Frequencies:

   python code

   '''
   def compute_pair_freqs(splits):
       pair_freqs = defaultdict(int)
       for word, freq in word_freqs.items():
           split = splits[word]
           if len(split) == 1:
               continue
           for i in range(len(split) - 1):
               pair = (split[i], split[i + 1])
               pair_freqs[pair] += freq
       return pair_freqs
   '''

   6. Merge Most Frequent Pair:

   python code

   '''
   def merge_pair(a, b, splits):
       for word in word_freqs:
           split = splits[word]
           if len(split) == 1:
               continue
           i = 0
           while i < len(split) - 1:
               if split[i] == a and split[i + 1] == b:
                   split = split[:i] + [a + b] + split[i + 2:]
               else:
                   i += 1
           splits[word] = split
       return splits
   '''

   7. Train BPE:

   python code

   '''
   vocab = [""] + alphabet.copy()
   splits = {word: [c for c in word] for word in word_freqs.keys()}
   merges = {}

   vocab_size = 50
   while len(vocab) < vocab_size:
       pair_freqs = compute_pair_freqs(splits)
       best_pair = max(pair_freqs, key=pair_freqs.get)
       splits = merge_pair(*best_pair, splits)
       merges[best_pair] = best_pair[0] + best_pair[1]
       vocab.append(best_pair[0] + best_pair[1])

   print(merges)
   print(vocab)
   '''

   8. Tokenize New Text:


   python code

   '''
   def tokenize(text):
       pre_tokenize_result = tokenizer._tokenizer.pre_tokenizer.pre_tokenize_str(text)
       pre_tokenized_text = [word for word, offset in pre_tokenize_result]
       splits = [[l for l in word] for word in pre_tokenized_text]
       for pair, merge in merges.items():
           for idx, split in enumerate(splits):
               i = 0
               while i < len(split) - 1:
                   if split[i] == pair[0] and split[i + 1] == pair[1]:
                       split = split[:i] + [merge] + split[i + 2:]
                   else:
                       i += 1
               splits[idx] = split
       return sum(splits, [])

   print(tokenize("This is not a token."))
   '''

This implementation demonstrates the key steps of BPE, from building the initial vocabulary to merging frequent pairs and tokenizing new text.